<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Berkeley DB recoverability</title>
    <link rel="stylesheet" href="gettingStarted.css" type="text/css" />
    <meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
    <link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
    <link rel="up" href="transapp.html" title="Chapter 12.  Berkeley DB Transactional Data Store Applications" />
    <link rel="prev" href="transapp_filesys.html" title="Recovery and filesystem operations" />
    <link rel="next" href="transapp_tune.html" title="Transaction tuning" />
  </head>
  <body>
    <div xmlns="" class="navheader">
      <div class="libver">
        <p>Library Version 12.1.6.2</p>
      </div>
      <table width="100%" summary="Navigation header">
        <tr>
          <th colspan="3" align="center">Berkeley DB recoverability</th>
        </tr>
        <tr>
          <td width="20%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
          <th width="60%" align="center">Chapter 12.  Berkeley DB Transactional Data Store Applications </th>
          <td width="20%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
        </tr>
      </table>
      <hr />
    </div>
    <div class="sect1" lang="en" xml:lang="en">
      <div class="titlepage">
        <div>
          <div>
            <h2 class="title" style="clear: both"><a id="transapp_reclimit"></a>Berkeley DB recoverability</h2>
          </div>
        </div>
      </div>
      <p> 
        Berkeley DB recovery is based on write-ahead logging. This
        means that when a change is made to a database page, a
        description of the change is written into a log file. This
        description in the log file is guaranteed to be written to
        stable storage before the database pages that were changed are
        written to stable storage. This is the fundamental feature of
        the logging system that makes durability and rollback work. 
    </p>
      <p> 
        If the application or system crashes, the log is reviewed
        during recovery. Any database changes described in the log
        that were part of committed transactions and that were never
        written to the actual database itself are written to the
        database as part of recovery. Any database changes described
        in the log that were never committed and that were written to
        the actual database itself are backed-out of the database as
        part of recovery. This design allows the database to be
        written lazily, and only blocks from the log file have to be
        forced to disk as part of transaction commit.
    </p>
      <p> 
        There are two interfaces that are a concern when
        considering Berkeley DB recoverability: 
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li>
            The interface between Berkeley DB and the operating
            system/filesystem.
        </li>
          <li>
            The interface between the operating
            system/filesystem and the underlying stable storage
            hardware.
        </li>
        </ol>
      </div>
      <p> 
        Berkeley DB uses the operating system interfaces and its
        underlying filesystem when writing its files. This means that
        Berkeley DB can fail if the underlying filesystem fails in
        some unrecoverable way. Otherwise, the interface requirements
        here are simple: The system call that Berkeley DB uses to
        flush data to disk (normally fsync or fdatasync), must
        guarantee that all the information necessary for a file's
        recoverability has been written to stable storage before it
        returns to Berkeley DB, and that no possible application or
        system crash can cause that file to be unrecoverable.
    </p>
      <p> 
        In addition, Berkeley DB implicitly uses the interface
        between the operating system and the underlying hardware. The
        interface requirements here are not as simple.
    </p>
      <p> 
        First, it is necessary to consider the underlying page size
        of the Berkeley DB databases. The Berkeley DB library performs
        all database writes using the page size specified by the
        application, and Berkeley DB assumes pages are written
        atomically. This means that if the operating system performs
        filesystem I/O in blocks of different sizes than the database
        page size, it may increase the possibility for database
        corruption. For example, assume that Berkeley DB is writing
        32KB pages for a database, and the operating system does
        filesystem I/O in 16KB blocks. If the operating system writes
        the first 16KB of the database page successfully, but crashes
        before being able to write the second 16KB of the database,
        the database has been corrupted and this corruption may or may
        not be detected during recovery. For this reason, it may be
        important to select database page sizes that will be written
        as single block transfers by the underlying operating system.
        If you do not select a page size that the underlying operating
        system will write as a single block, you may want to configure
        the database to use checksums (see the <a href="../api_reference/C/dbset_flags.html" class="olink">DB-&gt;set_flags()</a> flag for
        more information). By configuring checksums, you guarantee
        this kind of corruption will be detected at the expense of the
        CPU required to generate the checksums. When such an error is
        detected, the only course of recovery is to perform
        catastrophic recovery to restore the database. 
    </p>
      <p> 
        Second, if you are copying database files (either as part
        of doing a hot backup or creation of a hot failover area),
        there is an additional question related to the page size of
        the Berkeley DB databases. You must copy databases atomically,
        in units of the database page size. In other words, the reads
        made by the copy program must not be interleaved with writes
        by other threads of control, and the copy program must read
        the databases in multiples of the underlying database page
        size. On Unix systems, this is not a problem, as these
        operating systems already make this guarantee and system
        utilities normally read in power-of-2 sized chunks, which are
        larger than the largest possible Berkeley DB database page
        size. Other operating systems, particularly those based on
        Linux and Windows, do not provide this guarantee and hot
        backups may not be performed on these systems by reading data
        from the file system. The <a href="../api_reference/C/db_hotbackup.html" class="olink">db_hotbackup</a> utility should be used on
        these systems.
    </p>
      <p>
        An additional problem we have seen in this area was in some
        releases of Solaris where the cp utility was implemented using
        the mmap system call rather than the read system call. Because
        the Solaris' mmap system call did not make the same guarantee
        of read atomicity as the read system call, using the cp
        utility could create corrupted copies of the databases.
        Another problem we have seen is implementations of the tar
        utility doing 10KB block reads by default, and even when an
        output block size was specified to that utility, not reading
        from the underlying databases in multiples of the block size.
        Using the dd utility instead of the cp or tar utilities (and
        specifying an appropriate block size), fixes these problems.
        If you plan to use a system utility to copy database files,
        you may want to use a system call trace utility (for example,
        ktrace or truss) to check for an I/O size smaller than or not
        a multiple of the database page size and system calls other
        than read.
    </p>
      <p> 
        Third, it is necessary to consider the behavior of the
        system's underlying stable storage hardware. For example,
        consider a SCSI controller that has been configured to cache
        data and return to the operating system that the data has been
        written to stable storage, when, in fact, it has only been
        written into the controller RAM cache. If power is lost before
        the controller is able to flush its cache to disk, and the
        controller cache is not stable (that is, the writes will not
        be flushed to disk when power returns), the writes will be
        lost. If the writes include database blocks, there is no loss
        because recovery will correctly update the database. If the
        writes include log file blocks, it is possible that
        transactions that were already committed may not appear in the
        recovered database, although the recovered database will be
        coherent after a crash.
    </p>
      <p> 
        If the underlying hardware can fail in any way so that only
        part of the block was written, the failure conditions are the
        same as those described previously for an operating system
        failure that writes only part of a logical database block. In
        such cases, configuring the database for checksums will ensure
        the corruption is detected. 
    </p>
      <p> 
        For these reasons, it may be important to select hardware
        that does not do partial writes and does not cache data writes
        (or does not return that the data has been written to stable
        storage until it has either been written to stable storage or
        the actual writing of all of the data is guaranteed, barring
        catastrophic hardware failure — that is, your disk drive
        exploding).
    </p>
      <p> 
        If the disk drive on which you are storing your databases
        explodes, you can perform normal Berkeley DB catastrophic
        recovery, because it requires only a snapshot of your
        databases plus the log files you have archived since those
        snapshots were taken. In this case, you should lose no
        database changes at all. 
    </p>
      <p> 
        If the disk drive on which you are storing your log files
        explodes, you can also perform catastrophic recovery, but you
        will lose any database changes made as part of transactions
        committed since your last archival of the log files.
        Alternatively, if your database environment and databases are
        still available after you lose the log file disk, you should
        be able to dump your databases. However, you may see an
        inconsistent snapshot of your data after doing the dump,
        because changes that were part of transactions that were not
        yet committed may appear in the database dump. Depending on
        the value of the data, a reasonable alternative may be to
        perform both the database dump and the catastrophic recovery
        and then compare the databases created by the two methods. 
    </p>
      <p> 
        Regardless, for these reasons, storing your databases and
        log files on different disks should be considered a safety
        measure as well as a performance enhancement.
    </p>
      <p> 
        Finally, you should be aware that Berkeley DB does not
        protect against all cases of stable storage hardware failure,
        nor does it protect against simple hardware misbehavior (for
        example, a disk controller writing incorrect data to the
        disk). However, configuring the database for checksums will
        ensure that any such corruption is detected. 
    </p>
    </div>
    <div class="navfooter">
      <hr />
      <table width="100%" summary="Navigation footer">
        <tr>
          <td width="40%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
          <td width="20%" align="center">
            <a accesskey="u" href="transapp.html">Up</a>
          </td>
          <td width="40%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
        </tr>
        <tr>
          <td width="40%" align="left" valign="top">Recovery and filesystem operations </td>
          <td width="20%" align="center">
            <a accesskey="h" href="index.html">Home</a>
          </td>
          <td width="40%" align="right" valign="top"> Transaction tuning</td>
        </tr>
      </table>
    </div>
  </body>
</html>
